perf/chunktransform by d-v-b · Pull Request #3722 · zarr-developers/zarr-python

d-v-b · 2026-02-26T01:59:50Z

This PR defines a CodecChain object, which is similar to CodecPipeline except that it performs no orchestration. It just defines array-spec-aware, pure-compute encoding and decoding routines for collections of codecs. Part of #3720

depends on #3721

edit: CodecChain is now called ChunkTransform

Introduces CodecChain, a frozen dataclass that chains array-array, array-bytes, and bytes-bytes codecs with synchronous encode/decode methods. Pure compute only -- no IO, no threading, no batching. Also adds sync roundtrip tests for individual codecs (blosc, gzip, zstd, crc32c, bytes, transpose, vlen) and CodecChain integration tests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

d-v-b · 2026-02-27T19:42:19Z

updating this PR to blend the codecchain logic with array_spec logic to define a ChunkTransform object that carries both together.

# Conflicts: # src/zarr/abc/store.py # src/zarr/storage/_common.py # src/zarr/storage/_local.py # src/zarr/testing/store.py # tests/test_codecs/test_zstd.py

d-v-b · 2026-03-16T21:19:12Z

this PR is necessary for the broader performance plan because it allows us to pipeline codecs that support synchronous execution (i.e., all of them, practically). We can use the ChunkTransform class to "fuse" the non-IO part of chunk encoding into a single callable.

this differs from the CodecPipeline because that data structure does orchestration and IO, which is a separate layer of complexity.

d-v-b · 2026-03-26T21:11:35Z

@TomAugspurger any interest in reviewing this PR? It's part of a series that culminate in the performance improvements evident in #3719

TomAugspurger

Big +1 to the general design and the implementation looks nice. Just minor nitpicks that can be addressed / ignored as you prefer.

TomAugspurger · 2026-03-27T14:03:20Z

src/zarr/core/codec_pipeline.py

+    ArrayArrayCodec transforms — i.e. the spec that feeds the
+    ArrayBytesCodec.
+
+    All codecs must implement ``SupportsSyncCodec``. Construction will


Idle thought without having yet looked at the implementation: make Codec generic over SupportsSyncCodec (via a protocol) so that this can be caught before runtime?

I want to avoid touching the codec ABC, so maybe we defer this for a later effort

TomAugspurger · 2026-03-27T14:09:29Z

src/zarr/core/codec_pipeline.py

+        self._bb_codecs = bb
+
+    @property
+    def shape(self) -> tuple[int, ...]:


How are shape and dtype ultimately used? They're a bit complicated to understand. Presumably you need them for your full perf PR, but I wanted to confirm that.

I'm on the fence about these attributes actually. I wanted to model the fact that the output of a ChunkTransform is an array, with a fixed shape and dtype, and the fact that the array -> array codecs can be thought of as layers of ChunkTransform objects. But I don't know if we actually need this. Maybe a richer return type annotation is a better way of conveying this information.

these attributes are gone, we can add them later if they are actually valuable

TomAugspurger · 2026-03-27T14:10:40Z

src/zarr/core/codec_pipeline.py

+        """
+        bb_out: Any = chunk_bytes
+        for bb_codec in reversed(self._bb_codecs):
+            bb_out = bb_codec._decode_sync(bb_out, self._ab_spec)  # type: ignore[attr-defined]


Can you add a comment about why the type: ignore is needed? Presumably it relates to that isinstance(c, SupportsSyncCodec) check above, which mypy can't see here in decode?

the type: ignore is gone thanks to making the protocol generic.

TomAugspurger · 2026-03-27T14:12:27Z

src/zarr/core/codec_pipeline.py

+        for aa_codec, spec in reversed(self._aa_codecs):
+            ab_out = aa_codec._decode_sync(ab_out, spec)  # type: ignore[attr-defined]
+
+        return ab_out  # type: ignore[no-any-return]


This type: ignore I don't understand. Is the Any up above accurate, or should this be NDBuffer | Buffer like I see here? And if it supposed to be NDBuffer | Buffer why do we declare we return -> NDBuffer

this was because the protocol for synchronous encoding / decoding was not generic over input and output types. I fixed that, so the type: ignore statement is gone

TomAugspurger · 2026-03-27T14:19:10Z

tests/test_sync_codec_pipeline.py

+    def test_encode_decode_roundtrip_bytes_only(self) -> None:
+        # Minimal round-trip: BytesCodec serializes the array to bytes and back.
+        # No compression, no AA transform.
+        arr = np.arange(100, dtype="float64")
+        spec = _make_array_spec(arr.shape, arr.dtype)
+        chain = ChunkTransform(codecs=(BytesCodec(),), array_spec=spec)
+        nd_buf = _make_nd_buffer(arr)


FWIW, I'd be fine with consolidating these tests with the construction tests. I do like having focused construction tests when the constructors are complicated, but these seem simple enough that just seeing the traceback pointing to __post_init__ should be enough. Either works for me though.

good idea. I kept the encoding / decoding separate from the constructor tests, but I also made the constructor tests much more compact.

TomAugspurger · 2026-03-27T14:21:07Z

tests/test_sync_codec_pipeline.py

+        encoded = chain.encode(nd_buf)
+        assert encoded is not None
+        decoded = chain.decode(encoded)
+        np.testing.assert_array_equal(arr, decoded.as_numpy_array())


Might be worth refactoring this to a test helper, assert_encode_decode_equal(...).

Also, not worth worrying about currently, arr is possibly not a NumPy array, depending on what default_buffer_prototype() returns, in which case np.testing.assert_array_equal might not work. But to solve that more generally is out of scope.

tests/test_sync_codec_pipeline.py

dcherian · 2026-03-27T14:41:54Z

src/zarr/core/codec_pipeline.py

+
+        if aa_out is None:
+            return None
+        bb_out: Any = self._ab_codec._encode_sync(aa_out, self._ab_spec)  # type: ignore[attr-defined]


This is quite hard to read/review: "is the output of aa after application of ab really bb?!" Doesn't help that output is Any despite all the fancy typing.

what do you suggest

aa_out -> asarray; bb_out -> asbytes?

ok so it's an issue with the variable names, I will see if I can make them more clear

the properties are an issue too but I don't have a suggestion for that.

[points accusingly at the v3 spec] making 3 different kinds of functions all "codecs" was maybe not the best choice. We could use the "filters, serializer, compressors" trinity used in create_array, but I worry that this is a bit too far from the language used by the spec.

I'm going to punt on this for now. I legit think the naming issues here are a basic flaw in the v3 spec.

dcherian · 2026-03-27T14:42:05Z

src/zarr/core/codec_pipeline.py

+            if bb_out is None:
+                return None


could be hoisted out of the loop

or maybe not? I'm confused regardless.

the whole thing with encoding potentially returning none is problematic and needs to go. Unfortunately that's based on the codec ABC which I am not touching right now :(

dcherian · 2026-03-27T14:48:57Z

src/zarr/core/codec_pipeline.py

+            if aa_out is None:
+                return None


could be moved out of the loop? Can _encode_sync return None?

Can _encode_sync return None?

unfortunately yes, in keeping with methods already defined on the Codec API.

…into perf/codec-chain

d-v-b · 2026-03-27T21:05:47Z

thanks for the super helpful feedback @TomAugspurger and @dcherian! I made a lot of simplifying changes. LMK if anything else needs to be done, otherwise I will merge and move on to the next step of the performance work.

d-v-b and others added 4 commits February 25, 2026 18:33

add sync methods to codecs

2b64daa

refactor codecchain

41b7a6a

separate codecs and specs

5a2a884

github-actions bot added the needs release notes Automatically applied to PRs which haven't added release notes label Feb 26, 2026

add synchronous methods to stores

4e262b1

d-v-b mentioned this pull request Feb 26, 2026

perf/store sync #3725

Merged

chunktransform

71a780b

d-v-b force-pushed the perf/codec-chain branch from c0be8be to 71a780b Compare February 28, 2026 04:40

d-v-b added 7 commits February 28, 2026 09:32

Merge branch 'main' into perf/codec-chain

23344e0

Merge branch 'main' into perf/codec-chain

55821b8

Merge remote-tracking branch 'upstream/main' into perf/codec-chain

f64de89

# Conflicts: # src/zarr/abc/store.py # src/zarr/storage/_common.py # src/zarr/storage/_local.py # src/zarr/testing/store.py # tests/test_codecs/test_zstd.py

remove memorystore changes

4b22c46

chunktransform requires sync codecs

d22b6f0

lint

df7be19

simplify chunktransform by remove layers

31f9b24

d-v-b changed the title ~~perf/codec chain~~ perf/chunktransform Mar 16, 2026

d-v-b requested a review from dcherian March 16, 2026 21:19

d-v-b added 2 commits March 16, 2026 22:22

rename to encode / decode

2c93b93

Merge branch 'main' into perf/codec-chain

6cf8957

TomAugspurger approved these changes Mar 27, 2026

View reviewed changes

dcherian reviewed Mar 27, 2026

View reviewed changes

d-v-b added 3 commits March 27, 2026 21:00

Merge branch 'main' of https://github.com/zarr-developers/zarr-python …

2cbdfc8

…into perf/codec-chain

docs: improve type: ignore explanations

9d01432

refactor: SupportsSyncCodec is generic, like BaseCodec

5fd5b6e

d-v-b added 4 commits March 27, 2026 21:24

chore: remove shape and dtype attributes

a306e72

test: update tests

64f25f3

test: clean up tests

92754be

chore: remove type: ignores

5fda260

Uh oh!

Conversation

d-v-b commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

d-v-b commented Feb 27, 2026

Uh oh!

d-v-b commented Mar 16, 2026

Uh oh!

d-v-b commented Mar 26, 2026

Uh oh!

TomAugspurger left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

d-v-b commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

d-v-b commented Feb 26, 2026 •

edited

Loading